Skip to content

BUG: Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396 #12403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

multiloc
Copy link
Contributor

@@ -4802,6 +4802,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

if self.is_null and not getattr(self.block, 'is_categorical',
None):
if isinstance(empty_dtype, DatetimeTZDtype):
# handle timezone-aware all NaT cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if u can make this if else flow s bit more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is make this an else-if so this cleanly handles the different dtypes (categorical, datetime with tz) and is obvious what is going on

@multiloc
Copy link
Contributor Author

Could do

if self.is_null and not getattr(self.block, 'is_categorical',
                                None):
    num_elements = np.prod(self.shape)
    if isinstance(empty_dtype, DatetimeTZDtype):
        # handle timezone-aware all NaT cases
        return DatetimeIndex([fill_value] * num_elements,
                             dtype=empty_dtype)
    else:
        missing_arr = np.empty(self.shape, dtype=empty_dtype)
        if num_elements:
            # ...
            missing_arr.fill(fill_value)
        return missing_arr

@jreback jreback changed the title BUG fix GH12396: BUG: Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396 Feb 22, 2016
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Feb 22, 2016
@jreback jreback added this to the 0.18.0 milestone Feb 22, 2016
@@ -4802,6 +4802,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

if self.is_null and not getattr(self.block, 'is_categorical',
None):
if isinstance(empty_dtype, DatetimeTZDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also blow away the numpy 1.6 workaround comment / code.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2016

@hcontrast see what you can do here. This involves cleaning up some code and making it more generic to handle the extension dtypes. lmk if you get stuck.

@jreback jreback modified the milestones: 0.18.1, 0.18.0 Feb 23, 2016
@multiloc
Copy link
Contributor Author

Ok. Will have to wait until the weekend though

@jreback
Copy link
Contributor

jreback commented Mar 23, 2016

I modified some things in #12702

pls update/rebase when that's merged

@multiloc
Copy link
Contributor Author

Rebased on top of master. Having trouble running tests locally though

dtype = DatetimeTZDtype('ns', tz='UTC')
first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype)

result = pd.concat([first, second], 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always write out parameters axis=0 if you are passing them.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

sorry, meant here. can you rebase/update?

@multiloc
Copy link
Contributor Author

Updated the tests, but the mixed tz-/non-tz-aware cases are now breaking again:

======================================================================
ERROR: test_concat_NaT_dataframes_all_NaT_axis_0 (pandas.tools.tests.test_merge.TestConcatenate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../dev/pandas/pandas/tools/tests/test_merge.py", line 2606, in test_concat_NaT_dataframes_all_NaT_axis_0
    result = pd.concat([first, second], axis=0)
  File "/.../dev/pandas/pandas/tools/merge.py", line 836, in concat
    return op.get_result()
  File "/.../dev/pandas/pandas/tools/merge.py", line 1024, in get_result
    concat_axis=self.axis, copy=self.copy)
  File "/.../dev/pandas/pandas/core/internals.py", line 4544, in concatenate_block_managers
    for placement, join_units in concat_plan]
  File "/.../dev/pandas/pandas/core/internals.py", line 4641, in concatenate_join_units
    for ju in join_units]
  File "/.../dev/pandas/pandas/core/internals.py", line 4907, in get_reindexed_values
    missing_arr = np.empty(self.shape, dtype=empty_dtype)
TypeError: data type not understood

where

# pandas/tools/tests/test_merge.py:2601

# one side timezone-aware
dtype = DatetimeTZDtype('ns', tz='UTC')
first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype)

result = pd.concat([first, second], axis=0)

Looks like accidental/eager upcasting on closer inspection

block:  DatetimeTZBlock: slice(0, 1, 1), 1 x 2, dtype: datetime64[ns, UTC]
is_datetime_tz:  True
empty dtype:  datetime64[ns, UTC]
block:  DatetimeBlock: slice(0, 1, 1), 1 x 1, dtype: datetime64[ns]
is_datetime_tz:  False
empty dtype:  datetime64[ns, UTC]

The second block is regular datetime but gets passed a tz-aware empty value, hence triggering the wrong case.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

ok I just pushed a change where we moved all of the concat stuff together (now in pandas/types/concat.py, was mostly from core/common.py).

use this to handle the path

com.is_extension_type(empty_dtype)
True

@multiloc
Copy link
Contributor Author

Thanks. I've applied that fix, only outstanding issue is the mixed Series-DF case, which doesn't preserve the dtype information correctly:

FAIL: test_concat_NaT_series_dataframe_all_NaT (pandas.tools.tests.test_merge.TestConcatenate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../dev/pandas/pandas/tools/tests/test_merge.py", line 2696, in test_concat_NaT_series_dataframe_all_NaT
    assert_frame_equal(result, expect, check_dtype=True)
  File "/.../dev/pandas/pandas/util/testing.py", line 1166, in assert_frame_equal
    obj='DataFrame.iloc[:, {0}]'.format(i))
  File "/.../dev/pandas/pandas/util/testing.py", line 1026, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "/.../dev/pandas/pandas/util/testing.py", line 804, in assert_attr_equal
    left_attr, right_attr)
  File "/.../dev/pandas/pandas/util/testing.py", line 915, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  object
[right]: datetime64[ns]

    def test_concat_NaT_series_dataframe_all_NaT(self):
        # GH 12396

        # non-timezone aware
        first = pd.Series([pd.NaT, pd.NaT])
        second = pd.DataFrame([[pd.Timestamp('2015/01/01')],
                               [pd.Timestamp('2016/01/01')]])

        expect = pd.DataFrame([pd.NaT, pd.NaT, second.iloc[0, 0],
                               second.iloc[1, 0]], index=[0, 1, 0, 1])

        result = pd.concat([first, second])
        assert_frame_equal(result, expect)

        # one side timezone-aware
        dtype = DatetimeTZDtype('ns', tz='UTC')
        first = first.astype(dtype)
        result = pd.concat([first, second])
        assert_frame_equal(result, expect, check_dtype=True)
        self.assertEqual(result.dtypes.iloc[0], second.dtypes[0])

        # both sides timezone-aware
        second = second.apply(lambda x: x.astype(dtype))

        result = pd.concat([first, second])
        assert_frame_equal(result, expect, check_dtype=True)
        self.assertEqual(result.dtypes.iloc[0], second.dtypes[0])

@jreback
Copy link
Contributor

jreback commented Apr 18, 2016

that last already seems to work

@multiloc
Copy link
Contributor Author

Might be due to the patch then

@@ -4911,6 +4911,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):
pass
elif getattr(self.block, 'is_sparse', False):
pass
elif com.is_extension_type(empty_dtype):
num_elements = np.prod(self.shape)
# handle timezone-aware all NaT cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do this if it's a datetime tz type

elif com.is_extension_type(empty_dtype) and \
com.is_datetimetz(empty_dtype):
num_elements = np.prod(self.shape)
# handle timezone-aware all NaT cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, do this under the if getattr(self.block, 'is_datetimetz', False) or com.is_datetimetz(empty_dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like so? This passes the tests but not sure how good the coverage is for the non-NaT tz-aware cases.

                if getattr(self.block, 'is_datetimetz', False) \
                    or com.is_datetimetz(empty_dtype):
                    num_elements = np.prod(self.shape)
                    # handle timezone-aware all NaT cases
                    return DatetimeIndex([fill_value] * num_elements,
                                         dtype=empty_dtype)
                elif getattr(self.block, 'is_categorical', False):
                    pass
                elif getattr(self.block, 'is_sparse', False):
                    pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes more like this. yes coverage of extension types is prob only so now

@codecov-io
Copy link

codecov-io commented May 8, 2016

Current coverage is 84.14%

Merging #12403 into master will increase coverage by +<.01%

@@             master     #12403   diff @@
==========================================
  Files           137        137          
  Lines         50287      50288     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42311      42313     +2   
+ Misses         7976       7975     -1   
  Partials          0          0          
  1. File ...das/tseries/index.py (not in diff) was modified. more
    • Misses -1
    • Partials 0
    • Hits +1

Powered by Codecov. Last updated by af7bdd3...10406df

@jreback
Copy link
Contributor

jreback commented May 20, 2016

ok (this is the original issue)

In [1]: df1 = pd.DataFrame([[pd.NaT], [pd.NaT]])

In [2]: df2 = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')], [pd.Timestamp('2016/01/01', tz='UTC')]])

In [4]: pd.concat([df1, df2])
Out[4]: 
                          0
0                       NaT
1                       NaT
0 2015-01-01 00:00:00+00:00
1 2016-01-01 00:00:00+00:00

In [5]: pd.concat([df1, df2]).dtypes
Out[5]: 
0    datetime64[ns, UTC]
dtype: object

This is incorrect; we are coercing to object (correct), but the NaT are getting incorrectly corced to nan here.

In [6]: df2 = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')], [pd.Timestamp('2016/01/01', tz='US/Eastern')]])

In [7]: pd.concat([df1, df2]).dtypes
Out[7]: 
0    object
dtype: object

In [8]: pd.concat([df1, df2])
Out[8]: 
                           0
0                        NaN
1                        NaN
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00

@jreback
Copy link
Contributor

jreback commented May 31, 2016

@hcontrast can you update. only small fix I think.

@multiloc
Copy link
Contributor Author

Added test case for this, indeed still breaking

@multiloc
Copy link
Contributor Author

Can't actually get the test to break. This should break but doesn't

        first = pd.DataFrame([[pd.NaT], [pd.NaT]])
        second = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')],
                               [pd.Timestamp('2016/01/01', tz='US/Eastern')]])

        expect = pd.DataFrame([pd.NaT, pd.NaT, second.iloc[0, 0],
                               second.iloc[1, 0]], index=[0, 1, 0, 1])

        result = pd.concat([first, second], axis=0)
        assert_frame_equal(result, expect)
        print expect
        print result

gives

                           0
0                        NaT
1                        NaT
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00
                           0
0                        NaN
1                        NaN
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00

@jreback
Copy link
Contributor

jreback commented May 31, 2016

This is a tricky one; we have this flag to test exactness (so do this for now).

In [27]: tm.assert_frame_equal(result, expect, check_datetimelike_compat=True)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-27-441a8feab35d> in <module>()
----> 1 tm.assert_frame_equal(result, expect, check_datetimelike_compat=True)

/Users/jreback/pandas/pandas/util/testing.pyc in assert_frame_equal(left, right, check_dtype, check_index_type, check_column_type, check_frame_type, check_less_precise, check_names, by_blocks, check_exact, check_datetimelike_compat, check_categorical, check_like, obj)
   1268                 check_datetimelike_compat=check_datetimelike_compat,
   1269                 check_categorical=check_categorical,
-> 1270                 obj='DataFrame.iloc[:, {0}]'.format(i))
   1271 
   1272 

/Users/jreback/pandas/pandas/util/testing.pyc in assert_series_equal(left, right, check_dtype, check_index_type, check_series_type, check_less_precise, check_names, check_exact, check_datetimelike_compat, check_categorical, obj)
   1131         else:
   1132             assert_numpy_array_equal(left.get_values(), right.get_values(),
-> 1133                                      check_dtype=check_dtype)
   1134     else:
   1135         _testing.assert_almost_equal(left.get_values(), right.get_values(),

/Users/jreback/pandas/pandas/util/testing.pyc in assert_numpy_array_equal(left, right, strict_nan, check_dtype, err_msg, obj)
   1034     # compare shape and values
   1035     if not array_equivalent(left, right, strict_nan=strict_nan):
-> 1036         _raise(left, right, err_msg)
   1037 
   1038     if check_dtype:

/Users/jreback/pandas/pandas/util/testing.pyc in _raise(left, right, err_msg)
   1028             msg = '{0} values are different ({1} %)'\
   1029                 .format(obj, np.round(diff, 5))
-> 1030             raise_assert_detail(obj, msg, left, right)
   1031 
   1032         raise AssertionError(err_msg)

/Users/jreback/pandas/pandas/util/testing.pyc in raise_assert_detail(obj, message, left, right)
    983 [left]:  {2}
    984 [right]: {3}""".format(obj, message, left, right)
--> 985     raise AssertionError(msg)
    986 
    987 

AssertionError: numpy array are different

numpy array values are different (50.0 %)
[left]:  [nan, nan, 2015-01-01 00:00:00+00:00, 2016-01-01 00:00:00-05:00]
[right]: [NaT, NaT, 2015-01-01 00:00:00+00:00, 2016-01-01 00:00:00-05:00]

@sinhrks you think we should be checking for this by default? (I forgot why I added this, IIRC maybe for stata tests or something)

@sinhrks
Copy link
Member

sinhrks commented Jun 13, 2016

Or we should change assert_numpy_array_equal to use strict_nan=True. Not sure how many tests are relying on this.

@jreback
Copy link
Contributor

jreback commented Jun 24, 2016

@hcontrast can you update

@jreback
Copy link
Contributor

jreback commented Jun 30, 2016

can you rebase and update

@jreback
Copy link
Contributor

jreback commented Jul 6, 2016

can you rebase & update

@jreback jreback modified the milestones: Next Major Release, 0.18.2 Jul 6, 2016
@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

@jreback
Copy link
Contributor

jreback commented Nov 16, 2016

closing, but pls reopen / comment if you can continue

@jreback jreback closed this Nov 16, 2016
@jreback
Copy link
Contributor

jreback commented Nov 16, 2016

this is almost there .....

@paul-mannino
Copy link
Contributor

paul-mannino commented Jan 20, 2018

Want to knock this one out in conjunction with #18447. Just one question, what's the general desired behavior for missing data in concatenations? Should existing NaTs and nans always remain as they are, and otherwise use whatever the fill value is for the block for any new missing data created by the concatenation?

@jreback
Copy link
Contributor

jreback commented Jan 20, 2018

nan is always by type
eg NaT for daetimelile and nan otherwise

the tests look pretty good here

@jreback
Copy link
Contributor

jreback commented Jan 20, 2018

go ahead and open a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concatenation of DFs with all NaT columns and TZ-aware ones breaks
5 participants